This is a attempted partial reproduction study of J. Chakraborty’s 2021 study ‘Social inequities in the distribution of COVID-19: An intra-categorical analysis of people with disabilities in the U.S.’. In it, I’ll reproduce Chakraborty’s workflow extracting the covid incidence rate for different disability demographic groups, assessing the correlation between county-level case incidence and the population of these demographics. This section of the study uses Covid case data from August 1st, 2020. Because this was so early in the pandemic, spread across space definitely plays a role in case incidence. To account for this Chakraborty creates high incidence clusters as a spatial variable input to Generalized Estimating Equations.
Instead of running GEE’s, I’ll deviate from Chakraborty’s work by clustering county-level cases using kulldorf spatial filtering to generate case clusters and assess spatial scale of Covid spread. This will also serve as a test of the comperability of the spatialepi and SATSCAN implementation of this test, a potentially useful tool for open-source reproductions of epidemiological studies that utilize kulldorff statistics.
Chakraborty, J. 2021. Social inequities in the distribution of COVID-19: An intra-categorical analysis of people with disabilities in the U.S. Disability and Health Journal 14:1-5. DOI:[10.1016/j.dhjo.2020.101007](https://doi.org/10.1016/j.dhjo.2020.101007)
Key words: Epidemiology, Disease Spread, Disabilities,
Covid 19Subject: Social and Behavioral Sciences: Geography:
Human GeographyDate created: 3/3/25Date modified: 3/3/25Spatial Coverage: United States of AmericaSpatial Resolution: CountiesSpatial Reference System: EPSG:4326Temporal Coverage: March 2020 - August 2020Temporal Resolution: All recorded Covid 19 cases from
the beginning of the pandemic through August 1st, 2020Spatial Coverage: The 49 contiguous states in the
U.S.Spatial Resolution: CountiesSpatial Reference System: N/ATemporal Coverage: 1/22/20 - 8/1/2020Temporal Resolution: All recorded Covid 19 cases from
the beginning of the pandemic through August 1st, 2020. The data on
disability and sociodemographic characteristics come from the U.S.
Census American Community Survey (ACS) five-year estimates for 2018
(2014-2018)This work is a partial reproduction study to assess the feasibility of reproduction of geographic research. Because the original study is likely unable to be replicated perfectly solely from the avalible manuscript, planned deviations will change the end result of the study. I have two goals for this study:
A successful study result will be able to reach similar findings to Chakraborty in relation to the correlation between disability and incidence rate, as well as create county clusters of high covid incidence using kulldorf filtering to compare to the original SATSCAN output.
The original study is a observational study that seeks to answer the question “is Covid 19 incidence higher in counties that have a higher percentage of disabled population?”. This research question is ultimately broke into 5 specific hypothesis, investigating whether people with disabilities are effected differently depending on their race, ethnicity, poverty level, age, and biological sex.
The original study was conducted in ArcGIS and SaTScan software; in this reproduction I’ll attempt to match methods using R instead.
All variables in this study were derived from secondary data. There are no experimentally manipulated variables in this experiment. Eighteen independent variables, a percentage of total disabled persons per county and seventeen ‘disaggregated’ categories that break down socio-demographic characteristics of the disabled population. COVID-19 incidence rate can be seen as the dependent variables.
Title: County Level Poverty and Disability Data,
Derived from 2014-2018 American Community Survey 5-year EstimatesAbstract: This dataset documents poverty and disability
data on the county level. The US Census Bureau collects the data in
order to produces information on social, economic, housing, and
demographic characteristics about the nation’s population every year,
which provides an important tool for communities to use and see how they
are changing.Spatial Coverage: United States, OSM linkSpatial Resolution: Specify the spatial resolution as a
scale factor, description of the level of detail of each unit of
observation (including administrative level of administrative areas),
and/or or distance of a raster GRID sizeSpatial Representation Type: Specify the model of
spatial data representation, e.g. one of vector,
grid, textTable, tin
(triangulated irregular network), etc. If the type is
vector, also specify the geometry type as in the OGC Simple
Feature Access standard (https://www.ogc.org/publications/standard/sfa/) ,
e.g. POINT, LINESTRING,
MULTIPOLYGON, etc.Spatial Reference System: Specify the geographic or
projected coordinate system for the studyTemporal Coverage: 2014-2018Lineage: We computed the summary statistics, including
the min, max, mean, and standard deviation. We processed the data after
retrieving it by excluding states that irrelevant to our study area. We
also checked for data duplication and omission: we found one county with
missing disability and poverty data, for which we replaced the missing
data with zeros.Distribution: The data is avalible for download from
the US CensusConstraints: The US Census Data is open access in the
public domain.The variables in question are as follows, with the associated description from the ACS 5 year estimate data which is publicly available online:
| Variable Name in Study | ACS Variable name |
|---|---|
| percent of total civilian non-institutionalized population with a disability | S1810_C03_001E |
| Race | |
| percent w disability: White alone | S1810_C03_004E |
| percent w disability: Black alone | S1810_C0 3_005E |
| percent w disability: Native American | S1810_C03_006E |
| percent w disability: Asian alone | S1810_C03_007E |
| percent w disability: Other race | S1810_C03_009E |
| Ethnicity | |
| percent w disability: Non-Hispanic White | S1810_C03_0011E |
| percent w disability: Hispanic | S1810_C03_012E |
| percent w disability: Non-Hispanic non-White | (S1810_C02_001E - S1810_C02_011E - S1810_C02_012E) / (S1810_C01_001E - S1810_C01_011E - S1810_C01_012E) * 100 |
| percent w disability: Other race | S1810_C03_009E |
| Poverty | |
| percent w disability: Below poverty level | (C18130_004E + C18130_011E + C18130_018E) / C18130_001E * 100 |
| percent w disability: Above poverty level | (C18130_005E + C18130_012E + C18130_019E) / C18130_001E * 100 |
| Age | |
| percent w disability: 5-17 | S1810_C03_014E |
| percent w disability: 18-34 | S1810_C03_015E |
| percent w disability: 35-64 | S1810_C03_016E |
| percent w disability: 65-74 | S1810_C03_017E |
| percent w disability: 75+ | S1810_C03_018E |
| Biological sex | |
| percent w disability: male | S1810_C03_001E |
| percent w disability: female | S1810_C03_003E |
Title: County Level COVID-19 Incidence RateAbstract: This is the data repository for the 2019
Novel Coronavirus Visual Dashboard operated by the Johns Hopkins
University Center for Systems Science and Engineering (JHU CSSE)Spatial Coverage: United States, OSM linkSpatial Resolution: CountiesSpatial Representation Type: Vector PolygonSpatial Reference System: MercatorTemporal Coverage: 1/22/20-8-1-20Temporal Resolution: N/ALineage: The specific dataset used was provided by
Chakraborty and was been preprocessed in RStudio for analysis. There is
no readily apparent way to access archived data from the Johns Hopkins
University Center for Systems Science Engineering database. Archived
data versions can be found at the John Hopkins CCSE COVID-19 Data
Repository (https://github.com/CSSEGISandData/COVID-19).
However, archived data only provides summaries at the national
scale.We selected columns that are relevant to our analysis to simplify the
stored data, retaining POP_ESTIMA and Confirmed. These columns were then
renamed to pop and cases respectively. We then
calculated the COVID rate by dividing cases with the total population.
Finally, the geometry of the dataset was dropped. -
Distribution: The data is publically available from the JHU
website. - Constraints: The data is open access under the
Creative Commons Attribution 4.0 International (CC BY 4.0) by the Johns
Hopkins University on behalf of its Center for Systems Science in
Engineering. Copyright Johns Hopkins University 2020
## Reading layer `covidcase080120' from data source
## `/Users/lucas/Documents/GitHub/Covid_19_Clustering_Original/data/raw/public/covidcase080120.gpkg'
## using driver `GPKG'
## Simple feature collection with 3108 features and 87 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -13885850 ymin: 2817034 xmax: -7452848 ymax: 6340332
## Projected CRS: WGS 84 / Pseudo-Mercator
At the time of this study pre-registration, the authors had prior knowledge of the geography of the study region with regards to the geographic, population, and historical disease spread phenomena to be studied.
For each primary data source, declare the extent to which authors had already engaged with the data:
Extensive knowledge of the physical and social geography of the United States might mean that some conclusions drawn will be influenced by prior knowledge of movement patterns ect. Additionally, lived experience throughout the course of the pandemic may bias interpretation of patterns in the data.
From a data side, reporting and testing of cases was notoriously inconsistent across municipalities, states, and regions. This means that some counties will likely be incorrectly reported, and that inconsistencies are often correlated (a whole state de-emphasizing testing, resulting in low case counting)
I will first work with data from the ACS- specifically tables acs_vars_S1810 for disability data and acs5_c18130 for poverty data. Using mutate, I’ll be able to create a population percentage with disabilities by dividing the number of reported disabilities by total population that there exists a disability designation for.
This specifically is NOT the total population, as the total number of disability vs. no disability answers is less than the total population of counties.
# Join poverty data to disability data
acs <- acs %>% left_join(acs_pov, by = "GEO_ID")
# calculate percentages
acs_derived <- mutate(acs,
dis_pct = S1810_C02_001E / S1810_C01_001E * 100,
white_pct = S1810_C02_004E / S1810_C01_001E * 100,
black_pct = S1810_C02_005E / S1810_C01_001E * 100,
native_pct = S1810_C02_006E / S1810_C01_001E * 100,
asian_pct = S1810_C02_007E / S1810_C01_001E * 100,
other_pct =
(S1810_C02_008E + S1810_C02_009E + S1810_C02_010E) / S1810_C01_001E * 100,
non_hisp_white_pct = S1810_C02_011E / S1810_C01_001E * 100,
hisp_pct = S1810_C02_012E / S1810_C01_001E * 100,
non_hisp_non_white_pct =
(S1810_C02_001E - S1810_C02_012E - S1810_C02_011E) / S1810_C01_001E * 100,
bpov_pct = (C18130_004E + C18130_011E + C18130_018E) / C18130_001E * 100,
apov_pct = (C18130_005E + C18130_012E + C18130_019E) / C18130_001E * 100,
pct_5_17 = S1810_C02_014E / S1810_C01_001E * 100,
pct_18_34 = S1810_C02_015E / S1810_C01_001E * 100,
pct_35_64 = S1810_C02_016E / S1810_C01_001E * 100,
pct_65_74 = S1810_C02_017E / S1810_C01_001E * 100,
pct_75 = S1810_C02_018E / S1810_C01_001E * 100,
male_pct = S1810_C02_002E / S1810_C01_001E * 100,
female_pct = S1810_C02_003E / S1810_C01_001E * 100
)
# select only relevant geographic identifiers and derived percentages. Subsetting strings to select only the last 5 digits, which is the FIPS code format used by the JHU data
acs_derived <- acs_derived %>%
mutate(fips = substr(GEO_ID, nchar(GEO_ID) - 4, nchar(GEO_ID))) %>%
select(
fips,
statefp = STATE.x,
county = NAME.x,
county_st = NAME.x,
contains("pct")
)
#JHU Covid Data
Then, I’ll calculate case rate per 100,000 people and divide the JHU covid data with rate data into two products- one with and one without geometry. Having geometry associated with the data breaks things a few steps down the line, so it’s easier to just remove it in one of the versions now. I’ll then join the JHU Covid Case Rate data to the ACS data.
covid_mapping <- covid |>
mutate(covid_rate = round(covid$cases / covid$pop * 100000, 2))
covid_table <- covid_mapping|>
st_drop_geometry()
Here’s the resulting map, which is a match to Chakraborty’s Figure 1. Below, I also map disability rates by county which is something that Chakraborty didn’t do but that helps to visualize spatial patterns across counties.
Now I’ll join the JHU dataset to the ACS data by county-level FIPS code, a unique numerical identifier for each county.
# Join COVID incidence rate data to acs data
acs_covid <- covid_mapping %>%
left_join(acs_derived, by = "fips")
# move covid_rate column prior to disability percentages
acs_covid <- acs_covid %>%
select(fips, statefp, county, county_st, covid_rate, everything())
tm_disability_rates <- tm_shape(acs_covid) +
tm_polygons("dis_pct",
title = "Percent of People with Disability\n(ACS 2014-2018)",
style = "quantile",
border.alpha = .2,
lwd = 0.2,
palette = "Oranges"
)+
tm_shape(state) +
tm_borders("grey", lwd = .5) +
tm_layout(
legend.position = c("left", "bottom"),
legend.title.size = 0.8,
legend.text.size = 0.5
)
tm_disability_rates
It’s hard to pick out a discernible correlation/non-correlation from these two maps, so let’s dive into the data.
Deviation from plan I will also calculate descriptive statistics for each disability subgroup to match the original study’s methodology/results
Planned deviation for reanalysis: I also calculated the Shapiro Wilks test for normality, printed in the table with it’s corresponding p value.
acs_covid_stats <- acs_covid %>%
st_drop_geometry() %>%
select(covid_rate, contains("pct")) %>%
stat.desc(norm = TRUE) %>%
round(4) %>%
t() %>%
as.data.frame() %>%
select(min, max, mean, SD = std.dev, ShapiroWilk = normtest.W, p = normtest.p)
acs_covid_stats %>%
kable(caption = "Reproduced Descriptive Statistics",
align = "c") %>%
column_spec(2:6, width_min = "5em") %>%
column_spec(7, width_min = "2em") %>%
kable_styling(full_width = FALSE)
| min | max | mean | SD | ShapiroWilk | p | |
|---|---|---|---|---|---|---|
| covid_rate | 0.0000 | 14257.1700 | 966.9050 | 1003.9563 | 0.7448 | 0 |
| dis_pct | 3.8261 | 33.7058 | 15.9540 | 4.4028 | 0.9822 | 0 |
| white_pct | 0.8522 | 33.2588 | 13.5496 | 4.6308 | 0.9840 | 0 |
| black_pct | 0.0000 | 20.7024 | 1.4804 | 2.6563 | 0.6057 | 0 |
| native_pct | 0.0000 | 13.7360 | 0.2844 | 0.9442 | 0.2759 | 0 |
| asian_pct | 0.0000 | 3.4531 | 0.0905 | 0.1777 | 0.5053 | 0 |
| other_pct | 0.0000 | 15.2363 | 0.5490 | 0.6525 | 0.5730 | 0 |
| non_hisp_white_pct | 0.1026 | 33.1587 | 12.8358 | 4.8086 | 0.9870 | 0 |
| hisp_pct | 0.0000 | 25.2618 | 0.9855 | 2.1462 | 0.4173 | 0 |
| non_hisp_non_white_pct | 0.0000 | 20.9294 | 2.1327 | 2.7492 | 0.6991 | 0 |
| bpov_pct | 0.0000 | 14.9710 | 3.5689 | 1.8499 | 0.9274 | 0 |
| apov_pct | 3.6526 | 27.2984 | 12.4857 | 3.0542 | 0.9874 | 0 |
| pct_5_17 | 0.0000 | 5.0828 | 1.0327 | 0.4797 | 0.9527 | 0 |
| pct_18_34 | 0.0000 | 5.5864 | 1.5646 | 0.6718 | 0.9568 | 0 |
| pct_35_64 | 1.0121 | 18.3639 | 6.3466 | 2.3024 | 0.9588 | 0 |
| pct_65_74 | 0.0000 | 12.7298 | 3.0912 | 1.1556 | 0.9488 | 0 |
| pct_75 | 0.0000 | 11.1278 | 3.8663 | 1.1916 | 0.9652 | 0 |
| male_pct | 1.2999 | 18.1942 | 8.0579 | 2.3720 | 0.9764 | 0 |
| female_pct | 1.9075 | 19.9434 | 7.8961 | 2.2620 | 0.9815 | 0 |
Results from the Shapiro Wilks test indicate strong evidence against
normality for the majority of the variables used. This leads us into
affirming the results found using the parametric regression with a
non-parametric test also. To ensure consistency with Chakraborty’s
descriptive statistics I’ll subtract his summary stats table from mine,
with zeros across the board signifying a successful reproduction.
table1 is Chakraborty’s table, reconstructed from his
published paper.
# load original table 1 results
table1 <- read.csv(here("data", "raw", "public", "chakraborty", "table1.csv"))
# subtract original results from reproduced results
(select(acs_covid_stats, min, max, mean, SD) -
select(table1, min, max, mean, SD)) %>%
kable(caption = "Descriptive Statistics Comparison",
align = "c") %>%
column_spec(2:5, width = "4em") %>%
kable_styling(full_width = FALSE)
| min | max | mean | SD | |
|---|---|---|---|---|
| covid_rate | 0.0000 | 0.1700 | -0.0950 | -0.0437 |
| dis_pct | -0.0039 | -0.0042 | 0.0040 | 0.0028 |
| white_pct | 0.0022 | -0.0012 | -0.0004 | 0.0008 |
| black_pct | 0.0000 | 0.0024 | 0.0004 | -0.0037 |
| native_pct | 0.0000 | -0.0040 | 0.0044 | 0.0042 |
| asian_pct | 0.0000 | 0.0031 | 0.0005 | -0.0023 |
| other_pct | 0.0000 | -0.0037 | -0.0010 | 0.0025 |
| non_hisp_white_pct | 0.0026 | -0.0013 | -0.0042 | -0.0014 |
| hisp_pct | 0.0000 | 0.0018 | -0.0045 | -0.0038 |
| non_hisp_non_white_pct | 0.0000 | -0.0006 | 0.0027 | -0.0008 |
| bpov_pct | 0.0000 | 0.0010 | -0.0011 | -0.0001 |
| apov_pct | 3.6526 | -0.0016 | 0.0057 | -0.0058 |
| pct_5_17 | 0.0000 | 0.0028 | 0.0027 | -0.0003 |
| pct_18_34 | 0.0000 | -0.0036 | 0.0046 | 0.0018 |
| pct_35_64 | 0.0021 | 0.0039 | -0.0034 | 0.0024 |
| pct_65_74 | 0.0000 | -0.0002 | 0.0012 | -0.0044 |
| pct_75 | 0.0000 | -0.0022 | -0.0037 | 0.0016 |
| male_pct | -0.0001 | 0.0042 | -0.0021 | 0.0020 |
| female_pct | -0.0025 | 0.0034 | -0.0039 | 0.0020 |
These columns are JUST off- it looks like potentially Chakraborty has
rounded Covid cases to the whole person, which means that our
reproduction is mostly correct but will be a tiny bit skewed.
apov_pct is also being suspicious, with a higher minimum value
in my stats that doesn’t match table1’s results.
carrying on…
I will use cor() to determine the correlation between covid cases and pct disability using pearson’s correlation coefficient, running for each disability subgroup. P values will then be created for each correlation using the following formulas (as per statistics), assigning them to their respective columns with mutate.
df <- sum(!is.na(acs_covid$dis_pct)) - 2
pearsons_r <- acs_covid %>%
select(where(is.numeric)) %>%
st_drop_geometry() %>%
cor(method = "pearson", use = "everything") %>%
as.data.frame() %>%
select(r = covid_rate) %>%
mutate(
t = abs(r) / sqrt((1 - r^2) / (df)),
p = pt(t, df, lower.tail = FALSE)
) %>%
round(3) %>%
rownames_to_column("variable") %>%
dplyr::filter(variable != "covid_rate")
pearsons_r %>%
kable(caption = "Reproduced Pearson's R",
align = "c") %>%
column_spec(2:4, width = "4em") %>%
kable_styling(full_width = FALSE)
| variable | r | t | p |
|---|---|---|---|
| pop | 0.128 | 7.215 | 0.000 |
| cases | 0.209 | 11.891 | 0.000 |
| x | 0.099 | 5.540 | 0.000 |
| y | -0.412 | 25.195 | 0.000 |
| dis_pct | -0.060 | 3.350 | 0.000 |
| white_pct | -0.332 | 19.612 | 0.000 |
| black_pct | 0.460 | 28.847 | 0.000 |
| native_pct | 0.019 | 1.072 | 0.142 |
| asian_pct | 0.094 | 5.272 | 0.000 |
| other_pct | 0.026 | 1.460 | 0.072 |
| non_hisp_white_pct | -0.361 | 21.545 | 0.000 |
| hisp_pct | 0.119 | 6.686 | 0.000 |
| non_hisp_non_white_pct | 0.442 | 27.429 | 0.000 |
| bpov_pct | NA | NA | NA |
| apov_pct | NA | NA | NA |
| pct_5_17 | 0.084 | 4.688 | 0.000 |
| pct_18_34 | 0.063 | 3.493 | 0.000 |
| pct_35_64 | -0.008 | 0.460 | 0.323 |
| pct_65_74 | -0.091 | 5.113 | 0.000 |
| pct_75 | -0.186 | 10.541 | 0.000 |
| male_pct | -0.134 | 7.519 | 0.000 |
| female_pct | 0.023 | 1.305 | 0.096 |
This looks similar to Chakraborty’s final Pearson’s table; to compare it, I will find the difference between each table and the significance levels assigned to different variables.
# calculate number of significance stars at p < 0.01 and p < 0.05 levels.
pearsons_r <- mutate(pearsons_r, rp_stars = as.numeric(as.character(cut(p,
breaks = c(-0.1, 0.01, 0.05, 1),
labels = c(2, 1, 0)
))))
# join reproduction coefficients to original study coefficients
correlations <- table1 %>%
dplyr::filter(variable != "covid_rate") %>%
dplyr::select(variable, or_r = r, or_stars = stars) %>%
left_join(select(pearsons_r, variable, rp_r = r, rp_stars), by = "variable")
# find difference between coefficient and stars
correlations <- correlations %>%
bind_cols(rename_with(
correlations[, 4:5] - correlations[, 2:3],
~ paste0(.x, "_diff")
))
# find coefficients with different directions
correlations <- correlations %>% mutate(rp_dir_diff = (rp_r > 0) - (or_r > 0))
correlations %>%
kable(caption = "Compare reproduced and original Pearson's R",
col.names = c("Variable", "R", "Sig. Level", "R", "Sig. Level", "R", "Sig. Level", "Direction"),
align = "c") %>%
kable_styling() %>%
add_header_above(c(" " = 1, "Original" = 2, "Reproduced" = 2, "Difference" = 3))
| Variable | R | Sig. Level | R | Sig. Level | R | Sig. Level | Direction |
|---|---|---|---|---|---|---|---|
| dis_pct | -0.056 | 2 | -0.060 | 2 | -0.004 | 0 | 0 |
| white_pct | -0.326 | 2 | -0.332 | 2 | -0.006 | 0 | 0 |
| black_pct | 0.456 | 2 | 0.460 | 2 | 0.004 | 0 | 0 |
| native_pct | 0.020 | 0 | 0.019 | 0 | -0.001 | 0 | 0 |
| asian_pct | 0.097 | 2 | 0.094 | 2 | -0.003 | 0 | 0 |
| other_pct | 0.028 | 0 | 0.026 | 0 | -0.002 | 0 | 0 |
| non_hisp_white_pct | -0.355 | 2 | -0.361 | 2 | -0.006 | 0 | 0 |
| hisp_pct | 0.119 | 2 | 0.119 | 2 | 0.000 | 0 | 0 |
| non_hisp_non_white_pct | 0.439 | 2 | 0.442 | 2 | 0.003 | 0 | 0 |
| bpov_pct | 0.108 | 2 | NA | NA | NA | NA | NA |
| apov_pct | -0.146 | 2 | NA | NA | NA | NA | NA |
| pct_5_17 | 0.083 | 2 | 0.084 | 2 | 0.001 | 0 | 0 |
| pct_18_34 | 0.066 | 1 | 0.063 | 2 | -0.003 | 1 | 0 |
| pct_35_64 | -0.005 | 0 | -0.008 | 0 | -0.003 | 0 | 0 |
| pct_65_74 | -0.089 | 2 | -0.091 | 2 | -0.002 | 0 | 0 |
| pct_75 | -0.181 | 2 | -0.186 | 2 | -0.005 | 0 | 0 |
| male_pct | -0.131 | 2 | -0.134 | 2 | -0.003 | 0 | 0 |
| female_pct | 0.028 | 0 | 0.023 | 0 | -0.005 | 0 | 0 |
This correlation and significance value is the final product for my
reproduction of the initial phase of Chakraborty’s disability
correlation work in this paper. The final table should match the
corresponding table in his manuscript, which it does … mostly. All of
the significance values are slightly off from what Chakraborty’s
original calculation, although the levels are all the same with the
exception of pct_18_34, which went up in significance when
I ran the test. Not sure what the source of this is, but I have checked
through the workflow thoroughly so could be attributed to our initial
case rounding error.
deviation from published plan
Chakraborty didn’t test the assumption of normality that is required for a pearson’s correlation test (and the variables are non-normally distributed, as seen in my Shapiro-Wilks test); to adjust for this, I’ll conduct a Bivariate nonparametric correlation analysis to ensure the validity of his results.
df <- sum(!is.na(acs_covid$dis_pct)) - 2
spearmans_rho <- acs_covid %>%
select(where(is.numeric)) %>%
st_drop_geometry() %>%
cor(method = "spearman", use = "everything") %>%
as.data.frame() %>%
select(rho = covid_rate) %>%
mutate(
t = abs(rho) / sqrt((1 - rho^2) / (df)),
p = pt(t, df, lower.tail = FALSE)
) %>%
round(3) %>%
rownames_to_column("variable") %>%
dplyr::filter(variable != "covid_rate")
Compare the Spearman’s rho correlation coefficients to the reproduced Pearson’s r correlation coefficients. Differences are calculated as Spearman’s Rho - Pearson’s R.
# calculate number of significance stars at p<0.01 and P<0.05 levels.
spearmans_rho <- mutate(spearmans_rho, rp_rho_stars = as.numeric(as.character(cut(p,
breaks = c(-0.1, 0.01, 0.05, 1),
labels = c(2, 1, 0)
))))
correlations <- correlations[, 1:8] %>%
left_join(select(spearmans_rho, variable, rp_rho = rho, rp_rho_stars), by = "variable")
corrdiff <- select(correlations, starts_with("rp_rho")) -
select(correlations, rp_r, rp_stars)
correlations <- correlations %>% bind_cols(rename_with(corrdiff, ~ paste0(.x, "_diff")))
rm(corrdiff)
correlations <- correlations %>% mutate(rp_rho_dir_diff = (rp_rho > 0) - (rp_r > 0))
correlations %>%
select(variable, rp_r, rp_stars, starts_with("rp_rho")) %>%
kable(col.names = c("Variable", "R", "Stars", "Rho", "Stars", "Rho - R", "Stars", "Direction"),
align = "c") %>%
#column_spec(2:6, width_min = "5em") %>%
kable_styling() %>%
add_header_above(c(" " = 1, "Pearson's" = 2, "Spearman's" = 2, "Difference" = 3))
| Variable | R | Stars | Rho | Stars | Rho - R | Stars | Direction |
|---|---|---|---|---|---|---|---|
| dis_pct | -0.060 | 2 | -0.113 | 2 | -0.053 | 0 | 0 |
| white_pct | -0.332 | 2 | -0.421 | 2 | -0.089 | 0 | 0 |
| black_pct | 0.460 | 2 | 0.575 | 2 | 0.115 | 0 | 0 |
| native_pct | 0.019 | 0 | -0.084 | 2 | -0.103 | 2 | -1 |
| asian_pct | 0.094 | 2 | 0.194 | 2 | 0.100 | 0 | 0 |
| other_pct | 0.026 | 0 | 0.104 | 2 | 0.078 | 2 | 0 |
| non_hisp_white_pct | -0.361 | 2 | -0.454 | 2 | -0.093 | 0 | 0 |
| hisp_pct | 0.119 | 2 | 0.231 | 2 | 0.112 | 0 | 0 |
| non_hisp_non_white_pct | 0.442 | 2 | 0.481 | 2 | 0.039 | 0 | 0 |
| bpov_pct | NA | NA | NA | NA | NA | NA | NA |
| apov_pct | NA | NA | NA | NA | NA | NA | NA |
| pct_5_17 | 0.084 | 2 | 0.079 | 2 | -0.005 | 0 | 0 |
| pct_18_34 | 0.063 | 2 | 0.034 | 1 | -0.029 | -1 | 0 |
| pct_35_64 | -0.008 | 0 | -0.020 | 0 | -0.012 | 0 | 0 |
| pct_65_74 | -0.091 | 2 | -0.151 | 2 | -0.060 | 0 | 0 |
| pct_75 | -0.186 | 2 | -0.285 | 2 | -0.099 | 0 | 0 |
| male_pct | -0.134 | 2 | -0.201 | 2 | -0.067 | 0 | 0 |
| female_pct | 0.023 | 0 | -0.014 | 0 | -0.037 | 0 | -1 |
Significance changes between the two tests are seen, with native_pct and other_pct being reassigned to the ** significance level and pct_18_34 being bumped down to a * significance. This correlation table, which still shows correlation between various disability and demographic categories and Covid spread, is likely a more correct version of the original study’s correlations.
For the second section of the methods, I’ll attempt to create a map of clusters of counties that have high covid incidence rates. This is somewhat of a reproduction of Chakraborty’s work in that he used county clusters as an input into generalized estimating equations attempting to account for spatial autocorrelation in his disability analysis.
To delineate his high-incidence clusters Chakraborty used SATScan, a spatio-temporal statistical software that was developed for the monitoring of diseases that uses spherical great circle distance calculations from the centroid of each county to determine county cluster extents.Chakraborty then used the county clusters as a way to control for spatial variation in disability and covid rates, something that could be done with more common geographical techniques like a spatial regression. While potentially effective in this case, the selection of SATSCAN as the tool to account for spatial correlation is likely a product of Chakraborty’s epidemiology background, the discipline SatSCAN is meant to be used for. To attempt to mimic his clustering results using R, I will use a Kulldorff filter in the SpatialEpi R package, a hopefully analogous spatial clustering approach.
To generate clusters, SpatialEpi will use county relative risk (comparing local incidence to global incidence rates) to cluster counties into regional areas of relatively higher risk, grouping counties that have a similarly high risk profile together. The package will return these primary clusters of high-risk-counties and it’s best estimation of any secondary clusters, if they exist. To assess the effectiveness of this method, I will map clusters to visually determine if they are accurately modeling the perceived clustering of high risk counties.
start_time <- Sys.time()
covid_geo <- covid_table %>%
select(x, y) %>%
latlong2grid()
# latlong2grid creates approximate equidistant cylindrical grid
# could probably reproject to epsg 5070 and create table with x and y
# calculate expected cases with one strata
expected.cases <- expected(covid_table$pop, covid_table$cases, 1)
# Kulldorff spatial scan statistic
covid_kulldorff <- kulldorff(
geo = covid_geo,
cases = covid_table$cases,
population = covid_table$pop,
expected.cases = expected.cases,
pop.upper.bound = 0.5,
n.simulations = 999,
alpha.level = 0.05,
plot = TRUE
)
print(
paste(
"Run time:",
round(difftime(Sys.time(), start_time, units = "mins"), 2),
"minutes"
),
quote = FALSE
)
# save results in a file appended with the current date
saveRDS(covid_kulldorff,
file = here("data", "derived", "public", paste0("covid_kulldorff_", Sys.Date(), ".RDS"))
)
## [1] Most likely cluster:
## $location.IDs.included
## [1] 1824 1835 1797 1818 1825 1749 1854 1742 1837 1747 1838 280 1760 1846 1756
##
## $population
## [1] 16949211
##
## $number.of.cases
## [1] 469091
##
## $expected.cases
## [1] 233805.6
##
## $SMR
## [1] 2.006329
##
## $log.likelihood.ratio
## [1] 97983.07
##
## $monte.carlo.rank
## [1] 1
##
## $p.value
## [1] 0.001
## [1] Number of Secondary clusters: 132
I’ll first match the FIPS code of each assumed cluster to the FIPS codes of counties found in covid_table, the non-geometry DF verson of the JHU covid data. From these matched clusters and FIPS codes I’ll be able to map the assessed clusters qualitatively. This map should match Chakraborty’s clustering data that he inputted to his GEEs.
# list of primary cluster counties
primary_cluster_fips <- covid_kulldorff$most.likely.cluster$location.IDs.included
clusters <- data.frame(
fips = covid_table[primary_cluster_fips, "fips"],
clusterID = covid_table[primary_cluster_fips[1], "fips"],
likelihood = covid_kulldorff$most.likely.cluster$log.likelihood.ratio,
stringsAsFactors = FALSE
)
# Get a list of secondary clusters
secondary <- covid_kulldorff$secondary.clusters
# similarly add counties in each secondary cluster to the list of clusters
for (i in secondary) {
cluster_locations <- i$location.IDs.included
new_clusters <- data.frame(
fips = covid_table[cluster_locations, "fips"],
clusterID = covid_table[cluster_locations[1], "fips"],
likelihood = i$log.likelihood.ratio,
stringsAsFactors = FALSE
)
clusters <- rbind(clusters, new_clusters)
}
Joining the cluster data back to the acs_covid
dataframe
acs_covid <- acs_covid %>%
left_join(clusters, by = "fips")
acs_covid<-acs_covid|>
mutate(isCluster = case_when(
clusterID == fips ~ "center of cluster",
!is.na(clusterID) ~ "other part of cluster",
.default = NA
))
Planned deviation for reproduction: Map the
SpatialEpi cluster results.
acs_covid <- st_as_sf(acs_covid)
tm_spatialepi_clusters <-
tm_shape(acs_covid) +
tm_polygons(col = "isCluster",
palette = "-Oranges",
popup.vars = c("fips", "clusterID"),
colorNA = "grey",
title = "SpatialEpi Kulldorff COVID-19 Clusters",
border.col = "white",
lwd = 0.2,
border.alpha = 0.2) +
tm_shape(state) +
tm_borders("grey", lwd = .5) +
tm_layout(
legend.position = c("left", "bottom"),
legend.title.size = 0.8,
legend.text.size = 0.5
)
tm_spatialepi_clusters
This map shows our extracted covid clusters, with the center county of
the cluster in red and the counties within the cluster shown in cream.
On first pass, we see a huge cluster in the sounteast centered in the
Florida Panhandle, spreading through much of the south. We also see
several single-county clusters, especially in the more sparsely
populated Midwest and West. These are likely by themselves because the
surrounding regions lack the popultaion density to spread Covid
effectively.
I’ll now calculate the relative risk, a product of Chakraborty’s SATSCAN analysis but not of the SpatialEpi R package. This will need to be grouped by cluster to match the final SatScan output.
total_pop <- sum(acs_covid$pop)
total_cases <- sum(acs_covid$cases)
acs_covid <- acs_covid %>%
group_by(clusterID) %>%
mutate(
rr_cluster = ifelse(is.na(clusterID), NA,
(sum(cases) / sum(pop)) / ((total_cases - sum(cases)) / (total_pop - sum(pop)))
)
) %>%
ungroup() %>%
mutate(
rr_loc = (cases / pop) / ((total_cases - cases) / (total_pop - pop))
)
Relative risk will be classified on a scale from 1 to 6.
| Relative Risk Values | Relative Risk Class |
|---|---|
| Outside of cluster | 1 |
| RR < 1 | 1 |
| 1 <= RR < 2 | 2 |
| 2 <= RR < 3 | 3 |
| 3 <= RR < 4 | 4 |
| 4 <= RR < 5 | 5 |
| 5 <= RR < 6 | 6 |
Counties falling outside of any cluster are assigned a score of 1.
# class breaks
breaks <- c(-Inf, 1, 2, 3, 4, 5, Inf)
acs_covid <- acs_covid %>%
mutate(
cluster_class = ifelse(is.na(clusterID), 1, cut(rr_cluster, breaks, labels = FALSE)),
loc_class = cut(rr_loc, breaks, labels = FALSE)
)
Let’s visualize these relative risk scores spatially!
First, map the spatial distribution of local relative risk score classifications by county.
# count the frequency of counties in each class and create labels
class_freq <- acs_covid %>% st_drop_geometry() %>% count(loc_class)
class_freq$qual <- ifelse(class_freq$n > 1, " counties", " county")
class_freq[1, ]$qual <- paste(class_freq[1, ]$qual, "at low risk")
class_freq[nrow(class_freq), ]$qual <- paste(class_freq[nrow(class_freq), ]$qual, "at high risk")
class_freq$label <- paste0(class_freq$loc_class,
" (",
class_freq$n,
class_freq$qual,
")")
# Map Local Relative Risk scores
tm_spatialepi_local_risk_class <- tm_shape(acs_covid) +
tm_polygons("loc_class",
title = "Local Relative Risk Class",
border.col = "white",
border.alpha = .2,
lwd = 0.2,
palette = "Oranges",
style = "cat",
labels = class_freq$label
) +
tm_shape(state) +
tm_borders("grey", lwd = .5) +
tm_layout(
legend.position = c("left", "bottom"),
legend.title.size = 0.8,
legend.text.size = 0.5
)
tm_spatialepi_local_risk_class
Next, I map the cluster relative risk scores for comparison. Note that
following the original study classification methodology, counties
outside of clusters are assigned the lowest risk class of
1.
# count the frequency of counties in each class and create labels
class_freq <- acs_covid %>% st_drop_geometry() %>% count(cluster_class)
class_freq$qual <- ifelse(class_freq$n > 1, " counties", " county")
class_freq[1, ]$qual <- paste(class_freq[1, ]$qual, "at low risk")
class_freq[nrow(class_freq), ]$qual <- paste(class_freq[nrow(class_freq), ]$qual, "at high risk")
class_freq$label <- paste0(class_freq$cluster_class,
" (",
class_freq$n,
class_freq$qual,
")")
# map cluster relative risk scores
tm_spatialepi_cluster_risk_class <- tm_shape(acs_covid) +
tm_polygons("cluster_class",
title = "Cluster Relative Risk Class",
border.col = "white",
border.alpha = .2,
lwd = 0.2,
palette = "Oranges",
style = "cat",
labels = class_freq$label
) +
tm_shape(state) +
tm_borders("grey", lwd = .5) +
tm_layout(
legend.position = c("left", "bottom"),
legend.title.size = 0.8,
legend.text.size = 0.5
)
rm(class_freq)
tm_spatialepi_cluster_risk_class
We can see that the cluster relative risk does a decent job of describing the overall relative risk, with our large southeast cluster capturing a lot of the higher risk across the region. Isolated clusters in the Midwest also capture spatial differences in relative risk, with different risks seen in adjacent cluster.
To properly compare the SpatialEpi results to the SATSCAN output, we should probably know how many clusters we have. I’m going to count em and calculate the cluster with the largest number of included counties. The original study found 102 unique clusters with between 1 and 255 counties in each cluster.
# summarize clustering results
# summarize clustering results
cluster_summary <- acs_covid %>%
drop_na(clusterID)|>
dplyr::filter(cases > 0) %>%
st_drop_geometry() %>%
count(clusterID)
cat(
length(cluster_summary$n),
"unique clusters based on spatialEpi CLUSTER relative risk\n"
)
## 133 unique clusters based on spatialEpi CLUSTER relative risk
summary(cluster_summary$n)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.000 7.023 2.000 599.000
rm(cluster_summary)
We failed to reproduce the SATSCAN output, producing 133 unique clusters with between 1 and 599 counties, many more than the original output. This is a interesting result, potentially pointing to our implementation of the Kulldorf Scan Statistic differing from SATSCANS. One hypothesis is that it is clustering more sensitively to lower levels of risk, as seen in the southeast megacluster (clusterID 12005), centered on Bay County, Florida. At the same time as we see more preferential grouping into clusters, we also see the sheer numbers of clusters increase. This suggests that spatialepi is, in addition to being more willing to group counties together, happy to initialize more clusters.
Because this study seeks to both A) Reproduce the first part of Chakraborty’s work, assessing for validity and B) use our own methodology to recreate clustering done in an external program, analysis will be split into two sections. The first will explore the correlation between disability rate at the county scale and incidence - if there is a relationship between the disability subcategories and covid incidence like Chakraborty found, we should return significant r values. The second section will compare our finished clusters to Chakraborty’s before thinking about why any differences might occur, given that we’re using a different program to generate our clusters than he did. I’m not expecting that the clusters will turn out exactly the same, so it’ll be interesting to interrogate why any differences arise.
Agreement was found between our mapped disability data and Chakraborty’s Figure one, producing results without discernible differences. Summary statistics told a relatively similar story, with slight differences, likely in rounding, that reflected minorly different methodology. Specifically, it appeared that Chakraborty rounded Covid cases to the nearest whole person, a step I neglected to stay true to the data.
After performing Pearson’s correlation on the variables in the data, following Chakraborty’s methods, I subtracted resulting significance values from those presented in the corresponding table in the original paper, which was loaded for comparison. They are mostly identical with similar significance thresholds, but slightly different actual values. This could be attributable to the difference in rounding methodology. The only variable that changed significance threshold was pct_18_34, which changed from significance value 1 to value 2.
Testing the correlation between values using a bivariate nonparametric correlation test (Spearman’s rho) analysis to assess the validity of Chakraborty’s parametric test used even though the dataset wasn’t normally distributed revealed that several significance values changed, with native_pct and other_pct gaining significance and pct_18_34 being reassigned to a lower (1 star) significance level.
All in all, this section of the reproduction was a huge success. Despite the rounding issues we were, for the most part able to reproduce Chakraborty’s results, fuifuilling our first goal of reproducing his workflow in open source code. We were also able to add to the analysis by including a test for normality and due to the non-normal distribution of data perform a non-parametric correlation test. This test produced some different results than Chakraborty’s original analysis, deepening the degree of analysis.
Implementation of the Kulldorff spatial scan statistic in the spatialepi package instead of SATSCAN was less successful. Clusters were successfully pulled from the data but they did not match the size and number of clusters from the original manuscript. We found more clusters (133 vs 102) and larger clusters (599 vs 255), calling into question the applicability of this package to reproduce the clustering operation of SATSCAN. That being said, the clusters produced do appear to be sensible given the underlying data, so this is likely reflective of a difference in software application instead of an incorrect application of the package.
This is the only preregistration for this study.
This report is based upon the template for Reproducible and Replicable Research in Human-Environment and Geographical Sciences, DOI:[10.17605/OSF.IO/W29MQ](https://doi.org/10.17605/OSF.IO/W29MQ)
Chakraborty, J. 2021. Social inequities in the distribution of COVID-19: An intra-categorical analysis of people with disabilities in the U.S. Disability and Health Journal 14:1-5. DOI:[10.1016/j.dhjo.2020.101007](https://doi.org/10.1016/j.dhjo.2020.101007)